Despite engaging in one of the most primitive and individual athletic pursuits, distance runners accumulate an enormous amount of data. Across numerous distances, racing is strategic. However the bulk of an athletes preparation is devoted to building stamina, aerobic capacity and strength. At any level, improvement stems from stressing the body to ultimately achieve a stronger, fitter recovered state. But how much stress can the body handle and how much recovery is needed? The greatest hindrance to these athletes is injury. This ultimately represents an optimisation problem - athletes wish to maximise fitness derived from training stress whilst balancing the increasing risk of injury. There is an uncontrollable number of factors influencing training stress and recovery - including genetic, biomechanic, and lifestyle factors. With the provided, easily extractable data, this paper will attempt to produce a model that predicts injury risk. At the very least, this paper will grapple with the considerations involved.
Strava is the largest social fitness network in the world. The platform tracks GPS activities, offering distance, elevation and pace statistics. The explosion in fitness watch technology has been complementary to this platform, providing heartrate and cadence statistics. Although not a professional or high level athlete in any capacity, I have accumulated almost 3 years of fairly consistent training data. More importantly, I have complete access to this data without any restrictions. Fortunately, a friend of mine (let them be called ‘X’) also volunteered their data for these purposes. The data was extracted from the Strava API using Python, and saved as the CSV files attached. I’ll illustrate the data exploration, and any required cleansing on my own data.
Reference: https://www.strava.com/athletes/24203613
Firstly the data is imported and non-running activities (e.g. including bike rides, swims, workouts) are excluded.
# import data
df <- read.csv("/Users/apahljina/Documents/University of Arizona 2020/Sports Analytics/pahljina_stravadata_april20")
# extracting all running activites
dfrun <- df[which(df$type == 'Run'),]
Now, just a quick aside. There are several assumption made in this paper to simplify the application of this model, and I will detail them as they arise.
Firstly, in excluding non-running activities I’ve chosen to quantify the training and recovery benefits of only running activities. Training stress can be derived from non-running activities and evidence suggests they may aid recovery too. However, the statistics on these activities obviously varies, there has been many instances where I have not recorded non-running activities on this platform and I’d probably need to find some scientific backing in quantifying the stress and or recovery benefits. Thus for simplicity they have been excluded.
length(which(df$type != 'Run')) # the number of activities to be excluded from this model
## [1] 63
Now the CSV file contains a number of irrelevant features of each activity, so I chose to extract the following:
# collating most relevant statistics
dfrun <- dfrun[,c('X','upload_id','name','distance','moving_time','elapsed_time', 'total_elevation_gain', 'elev_high', 'start_date', 'start_date_local', 'timezone', 'average_speed', 'max_speed', 'has_heartrate', 'average_heartrate', 'max_heartrate','average_cadence')]
For clarification, I will quickly define what each of these features represent: X: index upload_id: unique ID for activities synced from a fitness watch or Strava’s inbuilt tracker name: the caption selected for the activity distance: in metres moving_time: the time spent at a detectable pace elapsed_time: the difference between the time at the start of the activity and the time of completion ‘total_elevation_gain’: in metres ‘elev_high’: highest altitude reached in the activity ‘start_date’: date of the activity ‘state_date_local’: date and time in the timezone of the activities location ‘timezone’: self-explanatory ‘average_speed’, ‘max_speed’: in metres per second ‘has_heartrate’: a boolean value detailing whether the activity contains heartrate data ‘average_heartrate’, ‘max_heartrate’: beats per minute ‘average cadence’: strides per minute (*taken modulo 100)
In converting the time data to a usable type in R, and adjusting speed units we obtain:
# manipulation of time values
dfrun[, c('start_date')] <- sapply(dfrun[, c('start_date')], as.character) # change to character value
dfrun[, c('start_date')] <- substr(dfrun[, c('start_date')],1,10) # remove additional characters
dfrun[, c('start_date')] <- as.Date(dfrun[, c('start_date')]) # convert to date value
## Warning in strptime(xx, f <- "%Y-%m-%d", tz = "GMT"): unknown timezone 'zone/tz/
## 2020a.1.0/zoneinfo/Australia/Sydney'
# conversion of speed to min/km
dfrun[, c('average_speed')] <- 1000/(dfrun[, c('average_speed')]*(60))
dfrun[, c('max_speed')] <- 1000/(dfrun[, c('max_speed')]*(60))
# show the resulting dataframe
head(dfrun, 5)
To give a visual overview of these runs, consider:
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
g <- ggplot() +
geom_point(aes(x = start_date, y = distance, label = name), colour = "red", size = 0.3, data = dfrun) + xlab("Date") + ylab("Distance (m)")
## Warning: Ignoring unknown aesthetics: label
ggplotly(g)
My first goal was to quantify the intensity of an indvidual activity.
The following assumptions were made when selecting the statistics most influential to intensity:
length(which(df$has_heartrate != 'True')) # the number of activities without heart rate data
## [1] 152
# fix N/A activities
which(is.na(dfrun$elapsed_time))
## integer(0)
dfrun$elapsed_time[522] <- dfrun$moving_time[522]
## Warning in `[<-.factor`(`*tmp*`, 522, value = structure(c(646L, 612L, 251L, :
## invalid factor level, NA generated
# convert to time objects
library('lubridate')
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
dfrun$moving_time <- hms(dfrun$moving_time)
dfrun$elapsed_time <- hms(dfrun$elapsed_time)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
# ratio
dfrun$ratio = as.double(dfrun$moving_time)/as.double(dfrun$elapsed_time)
With these assumptions, I formulated a rather basic metric of intensity to be ratio*(elevation gain + distance)/speed. Intuitively this makes sense, since a steep, long and fast run with no rest maximises the metric. I then considered that training stress must be relative to the athletes ability - i.e. an effort that is intense for a beginner would probably have no impact on a seasoned or professional runner. To normalise these statistics, I considered distance and elevation gain as a percent of the maximum distance and elevation gain, and speed as a percent of the fastest average speed. With some adjusted weighting, I obtained a metric I felt captured the relative intensity of these activities. The metric has been scaled to a scale out of 100 to be more readable.
dfrun$intensity = 1000*dfrun$ratio*(((0.3*dfrun$distance/max(dfrun$distance)) + 0.6*(dfrun$total_elevation_gain/max(dfrun$total_elevation_gain)))/((dfrun$average_speed)/min(dfrun$average_speed))**3)
If we have a look at the activities with the highest intensity under this metric, they are all race efforts, predominantly over longer distances (14km to 100km):
head(dfrun[order(-dfrun$intensity),], 5)
My next task was to quantify the recovery between activities, to ultimatey gauge how training stress may accumulate. For simplicity, I have assumed time is the only variable of recovery. I have modelled recovery as a simple exponential decay function, illustrated below. Under this model, an activity close to absolute maximum intensity (~100) would take approximately 2 weeks to completely recover from. The exponential function reflects the immediate recovery gains of an intense activity but also the delay in achieving complete recovery.
# defining a decay function
t = (1:400)
plot(x = t,y = 100*exp(-0.009*t), main = "recovery function", xlab = "hours", ylab = "intensity")
To apply this recovery function, I first needed to calculate the time difference in hours between consecutive activities:
# reformatting
dfrun[, c('start_date_local')] <- substr(dfrun[, c('start_date_local')],12,19) # remove additional characters
# concatenate date and time as objects
dfrun$POSIX <- paste(dfrun$start_date, dfrun$start_date_local)
dfrun$time <- as.POSIXct(dfrun$POSIX)
dfrun <- dfrun[order(-dfrun$X),] # ordering
# finding the time difference between consecutive activities
i = 2
while (i < (dim(dfrun)[1]+1)){
dfrun$timediff[i] = as.numeric(difftime(dfrun$POSIX[i],dfrun$POSIX[i-1], units = "hours"))
i = i + 1
}
We see that the largest time between recorded activities is 552 hours (~23 days), which occured following an injury.
head(dfrun[order(-dfrun$timediff),c('name', 'distance', 'timediff')],1)
In applying the recovery function the following steps have been iterated: (1) Consider the intensity of the previous activity along the above recovery curve (2) Calculate the recovery gains relative to this intensity level and the time that has passed (3) Add the ‘left over’ intensity to the intensity of the current activity to obtain the training stress level at a specified point in time.
# recovery function
recovery_decay <- function(dfrun) {
i = 2
while (i < (dim(dfrun)[1]+1)) {
reltime = (-1000/9)*log(dfrun$intensity[i-1]/100) # inverse function
addtime = reltime + dfrun$timediff[i]
dfrun$accumintensity[i] = 100*exp(-0.009*addtime) + dfrun$intensity[i]
i = i + 1 }
return(dfrun$accumintensity)
}
# apply function
dfrun$accumintensity <- recovery_decay(dfrun)
# manually set the accumulated intensity of the first activity to be the intensity of that activity
dfrun$accumintensity[1] = dfrun$intensity[1]
To smooth out short-term fluctuations and capture the longer term trend in training stress, I applied a moving average to the accumulated intensity. This is somewhat biased towards predicting injuries that occur due to consistent overloading as opposed to injuries that arise from acutely intense training events. This will be discussed in further detail later in the paper.
# rolling mean function
rolling_average <- function(dfrun) {
i = 1
while (i < (dim(dfrun)[1]+1)){
sum = 0
count = 0
for (j in (0:8)) {
if (i-j >= 1) {
sum = sum + dfrun$accumintensity[i-j]
count = count + 1
j = j+1
}
}
dfrun$rolmean[i] = sum/count
i = i+1
}
return(dfrun$rolmean)
}
dfrun$rolmean <- rolling_average(dfrun) # apply function
I manually inputted the dates and descriptions of my injuries, to see how they align with the model’s predictions. As I alluded to in my introduction, there is a fine line between increasing fitness and risking injury So I was also interested to see where my fitness has peaked, according to dates of my personal best times.
# best efforts
best.dates <- c('2019-09-25', '2019-09-11', '2019-09-15','2019-09-15','2018-09-16')
best.dates <- as.Date(best.dates)
dummy <- c(50,50,50,50,50)
best.distances <- c('1 km','1 mile', '5 km', '10 km', 'Half Marathon')
best.efforts <- data.frame(best.dates,dummy, best.distances)
# injuries
inj.dates = c('2020-01-15', '2019-09-25', '2019-05-19')
inj.dates <- as.Date(inj.dates)
inj.desc = c('foot pain', 'stress reaction', 'ITB')
dummy2 <- c(50,50,50)
injuries <- data.frame(inj.dates,dummy2, inj.desc)
I extracted the activities with the highest accumulated intensity as points where injury is likely to occur.
# activites with highest intensity
risk <- head(dfrun[(order(-dfrun$rolmean)),],10)
Finally we can visualise the accumulated intensity of my training (in red), the points where injury is most likely to occur (in yellow), the activities where I ran personal bests (in blue) and the dates of injuries (in green).
library(plotly)
g <- ggplot() +
geom_segment(aes(x = start_date, xend = start_date, y = 0, yend = rolmean, label = name),colour = "red", data = dfrun) +
geom_segment(aes(x = start_date, xend = start_date, y = 0, yend = rolmean, label = name),colour = "yellow", data = risk) +
geom_point(aes( x = start_date, y = rolmean, label = name), colour = "red", size = 0.3, data = dfrun) +
geom_segment(aes(x = best.dates, xend = best.dates, y = 0, yend = dummy, label = best.distances), colour = "blue", data = best.efforts) + geom_segment(aes(x = inj.dates, xend = inj.dates, y = 0, yend = dummy2, label = inj.desc), colour = "green", data = injuries)
## Warning: Ignoring unknown aesthetics: label
## Warning: Ignoring unknown aesthetics: label
## Warning: Ignoring unknown aesthetics: label
## Warning: Ignoring unknown aesthetics: label
## Warning: Ignoring unknown aesthetics: label
ggplotly(g)
Before I delve into some analysis on this plot, I’d like to apply this methodology to X’s data. This athlete has a longer training history than me, however conveniently started recording with Strava around the same time. Their data spans until late 2019. I was interested to see how transferable this model is to a different athlete. For their privacy I have removed the names from their activities in the constructed dataframe.
# import data
df <- read.csv("/Users/apahljina/Documents/University of Arizona 2020/Sports Analytics/X_data")
dfrun<- df[which(df$type == 'Run'),] # extracting all running activites
dfrun <- dfrun[,c('X','upload_id','name','distance','moving_time','elapsed_time', 'total_elevation_gain', 'elev_high', 'start_date', 'start_date_local', 'timezone', 'average_speed', 'max_speed', 'has_heartrate', 'average_heartrate', 'max_heartrate','average_cadence')]
# manipulation of time values
dfrun[, c('start_date')] <- sapply(dfrun[, c('start_date')], as.character) # change to character value
dfrun[, c('start_date')] <- substr(dfrun[, c('start_date')],1,10) # remove additional characters
dfrun[, c('start_date')] <- as.Date(dfrun[, c('start_date')]) # convert to date value
# conversion of speed to min/km
dfrun[, c('average_speed')] <- 1000/(dfrun[, c('average_speed')]*(60))
dfrun[, c('max_speed')] <- 1000/(dfrun[, c('max_speed')]*(60))
which(is.na(dfrun$elapsed_time))
## integer(0)
dfrun$elapsed_time[424] <- dfrun$moving_time[424]
## Warning in `[<-.factor`(`*tmp*`, 424, value = structure(c(599L, 338L, 411L, :
## invalid factor level, NA generated
# convert to time objects
dfrun$moving_time <- hms(dfrun$moving_time)
dfrun$elapsed_time <- hms(dfrun$elapsed_time)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
# ratio
dfrun$ratio = as.double(dfrun$moving_time)/as.double(dfrun$elapsed_time)
# intensity metric
dfrun$intensity = 1000*dfrun$ratio*(((0.3*dfrun$distance/max(dfrun$distance)) + 0.6*(dfrun$total_elevation_gain/max(dfrun$total_elevation_gain)))/((dfrun$average_speed)/min(dfrun$average_speed))**3)
dfrun[, c('start_date_local')] <- substr(dfrun[, c('start_date_local')],12,19) # remove additional characters
# concatenate date and time
dfrun$POSIX <- paste(dfrun$start_date, dfrun$start_date_local)
dfrun$time <- as.POSIXct(dfrun$POSIX)
# ordering
dfrun <- dfrun[order(-dfrun$X),]
# time between activities
i = 2
while (i < (dim(dfrun)[1]+1)){
dfrun$timediff[i] = as.numeric(difftime(dfrun$POSIX[i],dfrun$POSIX[i-1], units = "hours"))
i = i + 1
}
# accumulated intensity function
dfrun$accumintensity <- recovery_decay(dfrun)
dfrun$accumintensity[1] = dfrun$intensity[1] # set initial intensity
dfrun$rolmean <- rolling_average(dfrun) # rolling mean function
# take the top 20 activities by rolling mean
risk <- head(dfrun[(order(-dfrun$rolmean)),],10)
# best efforts
best.dates <- c('2019-12-12', '2019-06-01', '2019-11-10')
best.dates <- as.Date(best.dates)
dummy <- c(50,50,50)
best.distances <- c('3 km','5 km', '10 km')
best.efforts <- data.frame(best.dates,dummy, best.distances)
# injuries
inj.dates = c('2018-10-06', '2019-01-14')
inj.dates <- as.Date(inj.dates)
inj.desc = c('ITB and ankle', 'ITB')
dummy2 <- c(50,50)
injuries <- data.frame(inj.dates,dummy2, inj.desc)
Named Plot (**FOR TIM)
g <- ggplot() +
geom_segment(aes(x = start_date, xend = start_date, y = 0, yend = rolmean),colour = "red", data = dfrun) +
geom_segment(aes(x = start_date, xend = start_date, y = 0, yend = rolmean),colour = "yellow", data = risk) +
geom_point(aes( x = start_date, y = rolmean), colour = "red", size = 0.3, data = dfrun) +
geom_segment(aes(x = best.dates, xend = best.dates, y = 0, yend = dummy, label = best.distances), colour = "blue", data = best.efforts) +
geom_segment(aes(x = inj.dates, xend = inj.dates, y = 0, yend = dummy2, label = inj.desc), colour = "green", data = injuries)
## Warning: Ignoring unknown aesthetics: label
## Warning: Ignoring unknown aesthetics: label
ggplotly(g)
In the first plot, the second chronological injury clearly aligns with a period of high training stress. This was a bone injury caused from overloading, and it in fact correlates with the highest level of accumulated training stress ever recorded in the model. However, if we consider my first injury and the injuries of athlete X, they were caused single high intensity races. The model detects the stress of the training period leading up to race, but not the acute intensity of the race itself. Moreover, the model detects high injury risk in both plots, where no injury actually occured. But why this is the case is hard to say - perhaps recovery was occuring faster than modelled or aided by lifestyle factors or perhaps the athlete was more adapted to this level of intensity in this time. Furthermore, the third injury in the first plot was largely attributed to lifestyle stress and thus isn’t detected by the model. As discussed and expected, high stress is also conducive to personal bests - with blue markers close to or following periods of training stress. Ultimately however, this model effectively mimics training stress, and can be applied to varied athletes. Whilst the periods of high intensity do not always result in injury, it still may be informative to an athlete.
Many points of improvement are referenced in the paper, however they will be summarised here: (1) This model is a simplification of reality - many of the assumptions made reflect knowledge within the sport but are intuitive and not directly backed by science. (2) As discussed, there is scope to incorporate heartrate and altitude data into the intensity metric. There is also scope to broaden data collection to lifestyle, genetic and biomechanic factors to customise this baseline model for different athletes. This would likely require far more scientific involvement, however would undoubtly increase the models accuracy. (3) Obtaining per km or minute data would also be advantageous to the metric, particularly to understand intensity derived from speed. Considering time spend at certain paces would give a greater indication of intensity than average pace. (4) When calculated time between activities I did not adjust for timezones. The start time of the activity is recorded in the local timezone, so if consecutive activities are completed in different timezones the time difference will need to be adjusted. This is rather a minor consideration, but may be more applicable to some athletes more than others. However we see even for my own data there have been numerous activities in different timezones
df <- read.csv("/Users/apahljina/Documents/University of Arizona 2020/Sports Analytics/pahljina_stravadata_april20")
dfrun<- df[which(df$type == 'Run'),] # extracting all running activites
unique(dfrun[, c('timezone')])
## [1] Australia/Sydney Europe/Amsterdam America/Phoenix
## [4] America/Denver America/Los_Angeles Australia/Hobart
## [7] Australia/Perth Australia/Adelaide Asia/Tokyo
## [10] Asia/Makassar Australia/Melbourne Australia/Brisbane
## [13] Europe/Rome Europe/Zurich Europe/Paris
## [16] Europe/Warsaw Indian/Maldives Asia/Kolkata
## 18 Levels: America/Denver America/Los_Angeles America/Phoenix ... Indian/Maldives